{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Introduction\n",
    "This notebook demonstrates a large code cell with multiple functions.\n",
    "We will later update a few lines in the middle of that big cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Below is a large cell with 100+ lines of code.\n",
    "# We'll define multiple functions for demonstration.\n",
    "# Lines are intentionally verbose/filler to reach 100+.\n",
    "\n",
    "def load_data() -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    Simulate loading data from a source.\n",
    "\n",
    "    This function creates a pandas DataFrame with some sample data.\n",
    "    The data includes three columns:\n",
    "    - 'A': Contains integer values with one missing value.\n",
    "    - 'B': Contains integer values with one missing value.\n",
    "    - 'C': Contains categorical values with one missing value.\n",
    "\n",
    "    The purpose of this function is to simulate the process of loading data\n",
    "    from an external source such as a CSV file, database, or API. In a real-world\n",
    "    scenario, this function would contain the logic to read data from the actual\n",
    "    data source and return it as a pandas DataFrame.\n",
    "\n",
    "    Returns:\n",
    "        pd.DataFrame: A DataFrame containing the sample data with columns 'A', 'B', and 'C'.\n",
    "    \"\"\"\n",
    "    data = {\n",
    "        'A': [1, 2, 3, None, 5],\n",
    "        'B': [10, 9, None, 7, 6],\n",
    "        'C': ['x', 'y', None, 'y', 'x']\n",
    "    }\n",
    "    df = pd.DataFrame(data)\n",
    "    return df\n",
    "\n",
    "def clean_data(df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    Clean the dataframe by filling missing values and performing additional cleaning steps.\n",
    "\n",
    "    This function takes a pandas DataFrame as input and performs the following cleaning operations:\n",
    "    1. Identifies numeric columns and fills missing values with the mean of the respective column.\n",
    "    2. Identifies categorical columns and fills missing values with the string 'missing'.\n",
    "    3. Replaces any negative values in column 'B' with 0.\n",
    "\n",
    "    The purpose of this function is to ensure that the DataFrame is free of missing values and\n",
    "    any negative values in column 'B' are handled appropriately. This is a common step in data\n",
    "    preprocessing to prepare the data for further analysis or modeling.\n",
    "\n",
    "    Args:\n",
    "        df (pd.DataFrame): The input DataFrame that needs to be cleaned.\n",
    "\n",
    "    Returns:\n",
    "        pd.DataFrame: The cleaned DataFrame with no missing values and negative values in column 'B' replaced with 0.\n",
    "    \"\"\"\n",
    "    # Some placeholder logic for cleaning.\n",
    "    # Let's say we fill numeric columns with mean.\n",
    "    numeric_cols = df.select_dtypes(include=[np.number]).columns\n",
    "    for col in numeric_cols:\n",
    "        df[col] = df[col].fillna(df[col].mean())\n",
    "\n",
    "    # Fill categorical columns with a placeholder\n",
    "    object_cols = df.select_dtypes(include=['object']).columns\n",
    "    for col in object_cols:\n",
    "        df[col] = df[col].fillna('missing')\n",
    "\n",
    "    # Additional cleaning steps:\n",
    "    # Replace negative values in column B with 0.\n",
    "    if 'B' in df.columns:\n",
    "        df.loc[df['B'] < 0, 'B'] = 0\n",
    "\n",
    "    # return the result\n",
    "    return df\n",
    "\n",
    "def transform_data(df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    Perform data transformation on the DataFrame.\n",
    "\n",
    "    This function takes a pandas DataFrame as input and performs several transformation operations:\n",
    "    1. Creates a new column 'A_squared' which contains the square of the values in column 'A'.\n",
    "    2. Creates a new column 'B_plus_one' which contains the values in column 'B' incremented by 1.\n",
    "\n",
    "    The purpose of this function is to enhance the DataFrame with additional features that may be useful\n",
    "    for further analysis or modeling. By creating new columns based on existing data, we can provide\n",
    "    more information to downstream processes and potentially improve the performance of machine learning models.\n",
    "\n",
    "    Args:\n",
    "        df (pd.DataFrame): The input DataFrame that needs to be transformed.\n",
    "\n",
    "    Returns:\n",
    "        pd.DataFrame: The transformed DataFrame with new columns 'A_squared' and 'B_plus_one'.\n",
    "    \"\"\"\n",
    "    # Some transformation placeholder\n",
    "    df['A_squared'] = df['A'] ** 2\n",
    "    df['B_plus_one'] = df['B'] + 1\n",
    "    return df\n",
    "\n",
    "def feature_engineer(df: pd.DataFrame) -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    Perform additional feature engineering on the DataFrame.\n",
    "\n",
    "    This function takes a pandas DataFrame as input and performs several additional feature engineering operations:\n",
    "    1. Creates a new column 'AB_ratio' which contains the ratio of the values in column 'A' to the values in column 'B'.\n",
    "       - To avoid division by zero, a small constant (0.001) is added to the denominator.\n",
    "    2. Creates a new column 'is_x' which is a binary indicator:\n",
    "       - The value is 1 if the corresponding value in column 'C' is 'x'.\n",
    "       - The value is 0 otherwise.\n",
    "\n",
    "    The purpose of this function is to create new features that may provide additional insights or improve the performance\n",
    "    of machine learning models. By engineering new features based on existing data, we can capture more complex relationships\n",
    "    and patterns that may not be immediately apparent from the original columns.\n",
    "\n",
    "    Args:\n",
    "        df (pd.DataFrame): The input DataFrame that needs further feature engineering.\n",
    "\n",
    "    Returns:\n",
    "        pd.DataFrame: The DataFrame with additional engineered features 'AB_ratio' and 'is_x'.\n",
    "    \"\"\"\n",
    "    df['AB_ratio'] = df['A'] / (df['B'] + 0.001)\n",
    "    df['is_x'] = df['C'].apply(lambda x: 1 if x == 'x' else 0)\n",
    "    return df\n",
    "\n",
    "def run_pipeline():\n",
    "    \"\"\"\n",
    "    Orchestrates all steps of the data processing pipeline.\n",
    "\n",
    "    This function serves as the main entry point for executing the entire data processing pipeline.\n",
    "    It performs the following steps in sequence to transform raw data into a cleaned and feature-engineered DataFrame:\n",
    "\n",
    "    1. Load Data:\n",
    "       - Calls the `load_data` function to simulate loading data from an external source.\n",
    "       - The data is loaded into a pandas DataFrame with columns 'A', 'B', and 'C'.\n",
    "       - The initial data contains some missing values and is used as the starting point for further processing.\n",
    "\n",
    "    2. Clean Data:\n",
    "       - Calls the `clean_data` function to clean the loaded DataFrame.\n",
    "       - Missing values in numeric columns are filled with the mean of the respective column.\n",
    "       - Missing values in categorical columns are filled with the string 'missing'.\n",
    "       - Any negative values in column 'B' are replaced with 0.\n",
    "       - The cleaned DataFrame is returned for the next step.\n",
    "\n",
    "    3. Transform Data:\n",
    "       - Calls the `transform_data` function to perform data transformation on the cleaned DataFrame.\n",
    "       - A new column 'A_squared' is created, containing the square of the values in column 'A'.\n",
    "       - A new column 'B_plus_one' is created, containing the values in column 'B' incremented by 1.\n",
    "       - The transformed DataFrame is returned for the next step.\n",
    "\n",
    "    4. Feature Engineer:\n",
    "       - Calls the `feature_engineer` function to perform additional feature engineering on the transformed DataFrame.\n",
    "       - A new column 'AB_ratio' is created, containing the ratio of the values in column 'A' to the values in column 'B'.\n",
    "         - To avoid division by zero, a small constant (0.001) is added to the denominator.\n",
    "       - A new column 'is_x' is created, which is a binary indicator:\n",
    "         - The value is 1 if the corresponding value in column 'C' is 'x'.\n",
    "         - The value is 0 otherwise.\n",
    "       - The feature-engineered DataFrame is returned as the final output.\n",
    "\n",
    "    Returns:\n",
    "        pd.DataFrame: The final DataFrame after all processing steps, including cleaning, transformation, and feature engineering.\n",
    "    \"\"\"\n",
    "    data = load_data()\n",
    "    data = clean_data(data)\n",
    "    data = transform_data(data)\n",
    "    data = feature_engineer(data)\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3. Usage Example\n",
    "df_pipeline = run_pipeline()\n",
    "df_pipeline"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}